Sequence assembly

ABSTRACT

The invention relates to assembly of sequence reads. The invention provides a method for identifying a mutation in a nucleic acid involving sequencing nucleic acid to generate a plurality of sequence reads. Reads are assembled to form a contig, which is aligned to a reference. Individual reads are aligned to the contig. Mutations are identified based on the alignments to the reference and to the contig.

FIELD OF THE INVENTION

The invention relates to assembly of sequence reads.

BACKGROUND

Methods to sequence or identify significant fractions of the humangenome and genetic variations within those segments are becomingcommonplace. However, a major impediment to understanding healthimplications of variations found in every human being remains unravelingof the functional meaning of sequence differences in individuals.Sequencing is an important first step that allows geneticists andphysicians to develop a full functional understanding of that data.

Next-generation sequencing (NGS) technologies include instrumentscapable of sequencing more than 10¹⁴ kilobase-pairs (kbp) of DNA perinstrument run. Sequencing typically produces a large number ofindependent reads, each representing anywhere between 10 to 1000 basesof the nucleic acid. Nucleic acids are generally sequenced redundantlyfor confidence, with replicates per unit area being referred to as thecoverage (i.e., “10× coverage” or “100× coverage”). Thus, a multi-genegenetic screening can produce millions of reads.

When a genetic screening is done for a person, the resulting reads canbe compared to a reference, such as a published human genome. Thiscomparison generally involves either assembling the reads into a contigand aligning the contig to the reference or aligning each individualread to the reference.

Assembling reads into a contig and aligning the contig to a referenceproduces unsatisfactory results due to the algorithms used for contigassembly. Generally, algorithms for contig assembly assess a read usingcertain quality criteria. Those criteria set a threshold at whichcertain reads that satisfy the algorithm are determined to be legitimatereads that are used to assemble the contigs, while reads that do notsatisfy the algorithm are excluded from the contig assembly process.Based on the threshold level of the algorithm, as many as 10% oflegitimate sequence reads are excluded from further analysis.

Additionally, aligning the resulting contig to a reference is errorprone due to a tradeoff, inherent in doing an alignment, between whethermismatches or gaps (insertions/deletions, or “indels”) are favored. Whenone sequence is aligned to another, if one sequence does not match theother perfectly, either gaps must be introduced into the sequences untilall the bases match or mismatched bases must appear in the alignment.Existing approaches to alignment involve algorithms with good mismatchsensitivity at the expense of indel sensitivity or good indelsensitivity at the expense of mismatch sensitivity. For example, if analignment is to detect mismatches with sufficient fidelity, then it islikely that some indels will be missed.

Even where assembly provides accurate detection of variants (e.g.,substitutions or indels), these methods are often computationallyintractable for high throughput data analysis because each read must becompared to every other read in a dataset to determine sequence overlapand build contigs.

Another sequence assembly technique involves aligning each individualread to a reference. This assembly technique is problematic because veryshort reads (e.g., 50 bp or less) may align well in a number of placeson a very long reference (e.g., 5 million bp). With a number of equallygood positions to align to, aligning a read to a reference offers littlepositional accuracy. Also, particularly with very short reads, longindels can be difficult or impossible to detect. Due in part to thetradeoff between substitution sensitivity and indel sensitivity, certainmutation patterns are particularly difficult to detect. Indels near theends of reads are sometimes incorrectly interpreted as short strings ofmismatched bases. Substitutions near indels are often interpretedincorrectly, as well.

Existing methods of read assembly do not offer the positional accuracyof a contig-based alignment while including detailed information fromeach read. Further, due to limitations in alignment algorithms, existingmethods do a poor job of correctly interpreting certain mutations (e.g.,indels near the ends of reads, substitutions near indels).

SUMMARY

The invention generally relates to genotyping nucleic acids throughmethods of assembling and aligning sequence reads. Methods of theinvention combine a contig based sequence assembly approach with anindividual alignment based sequence assembly approach in order to moreaccurately assemble sequence reads. Methods of the invention areaccomplished by assembling contigs from sequence reads, aligning thecontigs to a reference sequence, aligning the reads back to the contig,and identifying mutations via the assembled contig and the alignments.By assembling reads into contigs as well as aligning the individualreads to the contigs, the need to compare each of the reads to all ofthe others is avoided, providing computational tractability even forvery high throughput analyses. Information about variants in the sampleis identified through the alignment of the contigs to the reference.After aligning each raw read, the alignment of the read to the contig isused to map positional information and any identified differences (i.e.,variant information) from the reference to the raw read. The raw read isthen translated to include positional and variant information, allowinggenotyping to be performed using the aligned translated reads.

By these methods, positional accuracy of reads is obtained and thelimitations of a tradeoff between substitution sensitivity and deletionsensitivity are overcome. By combining data in this way, an accurate andsensitive interpretation of the nucleic acid is obtained and an accuratedescription of a genotype including an identity and a location of amutation on an organism's genome is reported.

Generally, nucleic acid obtained from a subject is sequenced to producea plurality of reads. Sequencing the nucleic acid may be by any methodknown in the art. For example, sequencing may involve pyrosequencing,sequencing-by-synthesis, reversible dye-terminators, or sequencing byhybridization.

The reads produced by sequencing are assembled into one or more contigs.Assembly into contigs may be by any method including: collection intosubsets based on, for example, barcodes, primers, or probe sets; aheuristic method; or any traditional assembly algorithm known in theart. For example, assembly de novo can include an assembly using agreedy algorithm. A computer processor can select the two reads from aplurality that exhibit the most overlap, join them to produce a newread, and repeat, thereby assembling the reads produce an contig. Incertain embodiments, assembly includes positioning the reads relative toeach other by making reference to a reference genome. This occurs where,for example, the subject nucleic acid is sequenced as captured bymolecular inversion probes, because the start of each read derives froma genomic position that is known and specified by the probe set design.

Where a number of different probe sets are used, the full set of rawreads can be organized into subsets according to an associated probe.The number of reads that must be cross-compared is reducedsubstantially, providing computational tractability for very highthroughput data sets. In some embodiments, extrinsic knowledge of probeposition (i.e., relative offset in reference genome) is used to aid inpositioning the reads. By collecting each read per probe and/orpositioning each read according to a known relative offset, reads can beassigned to contigs a priori, further improving specificity ofpositional information.

In some embodiments, the reads may contain barcode information. Barcode,generally, refers to any sequence information used for identification,grouping, or processing. For example, barcode information may beinserted into template nucleic acid during sample prep or sequencing.Barcodes can be included to identify individual reads, groups of reads,subsets of reads associated with probes, subsets of reads associatedwith exons, subsets of reads associated with samples or any other group,or any combination thereof. For example, by sequencing sample andbarcode information, reads can be sorted (e.g., using a computerprocessor) by sample, exon, probe set, etc., or combination thereof,through reference to the barcode information. The reads may be assembledinto contigs by reference to the barcode information. A computerprocessor can identify the barcodes and assemble the reads bypositioning the barcodes together.

Reads are assembled into contigs. In the case of a single, homozygoustarget nucleic acid, a single contig per location is typically produced.In other cases such as, for example, multiple patient multiplexedanalyses, heterozygous targets, a rare somatic mutation, or a mixedsample, two or more contigs can be produced. Each contig can berepresented as a consensus sequence, i.e., a probable interpretation ofthe sequence of the nucleic acid represented by that contig.

Each contig is aligned to a reference genome. Where, for example, twocontigs were found for a given exon, both align to the same referenceposition, but one may contain variant information that the other doesnot. In certain embodiments, positioning a contig along a referencegenome involves indexing the reference, for example, with aBurrows-Wheeler Transform (BWT). Indexing a reference can be done bycomputer programs known in the art, for example, Burrows-Wheeler Aligner(BWA) available from the SourceForge web site maintained by Geeknet(Fairfax, Va.) using the bwa-sw algorithm. Contigs may then be alignedto the indexed reference. For example, where the bwa-sw algorithm isimplemented by BWA, parameters of the alignment are optimized to ensurecorrect placement of the contig on the overall reference genome. In someembodiments, protein-coding sequence data in a contig, consensussequence, or reference genome, is represented by amino acid sequence anda contig is aligned to a reference genome with reference to the aminoacid sequence.

With each contig aligned to the reference genome, any differencesbetween the contig and the reference genome can be identified. Since theentire contig with contiguous sequence from many overlapping reads ispositioned against the reference genome, the added sequence contextincreases the likelihood that the correct position is identified wherebyvariant information (e.g., mismatches and gaps) can be transferredtransiently and with high fidelity from reference to contig and fromcontig to reads. Thus, any indels introduced here likely representdifferences between the nucleic acid and the reference. Each mutationidentified here can be described with reference to the reference genome.In certain embodiments, a computer program creates a file or variablecontaining a description of the mutation (e.g., a compact idiosyncraticgapped alignment report (CIGAR) string).

With the contigs aligned, each raw read is aligned to a contig in alocal, contig-specific alignment. In certain embodiments, a pairwisealignment is performed between the read and the contig. In someembodiments, all reads align as a perfect match to at least one contig.For example, BWA with bwa-short can be used to align each read to acontig, and the output can be a Binary Alignment Map (BAM) including,for example, a CIGAR string. A file or variable can record the resultsof the local, contig-specific alignment (e.g., as a CIGAR string).

Each raw read-to-contig alignment is mapped to the reference using thecorresponding contig-to-reference alignment. By using the contig as avehicle, positional and variant information relative to the reference isprovided for each read, allowing each raw read alignment to betranslated to include positional and variant information relative to thereference genome.

Genotyping is performed using the translated, aligned reads. Raw qualityscores (e.g., for substitutions) can be included. The output of thelocal alignment, describing the read compared to the contig, can becombined with the output of the reference alignment, describing thecontig compared to the reference. This combination gives, for anymutation detected in the nucleic acid, a description of that mutationrelative to the reference genome. Wild type and mutant alleles includingspecific mutations can be identified. Mutation patterns previouslythought to pose particular difficulty (e.g., long indels, indel-proximalsubstitutions, and indels near the ends of reads) are identified withfidelity. Methods of the invention can perform, with high-throughputdata using existing computer power, sequencing and genotyping analysesthat were previously computationally intractable.

By combining information in this way, the limitations of a tradeoffbetween substitution sensitivity and deletion sensitivity is overcome.The output includes an accurate and sensitive interpretation of thesubject nucleic. This provides an accurate description of a genotypeincluding an identity and a location of a mutation on an organism'sgenome.

The output can be provided in the format of a computer file. In certainembodiments, the output is a VCF file, FASTA file, XML file, or similarcontaining sequence data such as a sequence of the nucleic acid alignedto a sequence of the reference genome. In certain embodiments, theoutput contains coordinates or a string describing one or more mutationsin the subject nucleic acid relative to the reference genome. In someembodiments, the output is a sequence alignment map (SAM) or binaryalignment map (BAM) file comprising a CIGAR string.

The output describes a mutation in the nucleic acid. Mutationsdetectable by methods of the invention include substitutions, singlenucleotide polymorphisms, insertions, deletions, translocations,epigenetic mutations, and copy number repeats. Methods of the inventiondetect and describe more than one mutation including, for example, twoor more mutations. Mutations can be detected that are near each other,for example less than about 1,000 nucleotides apart along a strand ofnucleic acid. In a preferred embodiment, methods of the invention candetect two mutations within about 100 bases of each other, such as, forexample, within about 10 bases or about 5 bases of each other. Methodsof the invention can detect two mutations of different types near eachother including, for example, a substitution near an indel. Mutationsnear the ends of sequence reads can be detected including, for example,an indel at or near an end of a read.

In certain embodiments, a sample is heterogeneous (e.g., includesheterogeneous tumor, circulating fetal, or other DNA) and at least someof the individual reads include variations relative to the contig.Mutations in individual reads can be detected when each raw read isaligned to a contig in a local, contig-specific alignment. Parameters ofthis alignment can be optimized to ensure very highly sensitivedetection of mutations and combinations of mutations. For example,because the position relative to the reference genome is establishedwith confidence, a gap penalty can be decreased relative to a substationprobability. By these methods, the individual reads can be examined todetect any mutations including, for example, mutations specific to thatread or only a few reads. That is, a difference between the referenceand an individual read is determined using differences between the readand the contig plus differences between the contig and the reference.For example, a contig may map to the reference with a 1 bp substitution,and a read may map to the contig with a 8 bp deletion. Methods of theinvention give the correct annotation for the read, relative to thereference, as a description including a 8 bp deletion as well as a 1 bpsubstitution.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a diagram of methods of the invention.

FIG. 2 illustrates an exemplary embodiment according to the invention.

FIG. 3 illustrates a system for performing methods of the invention.

DETAILED DESCRIPTION

The invention generally relates to a method for identifying a mutationin a nucleic acid. The invention provides methods for genetic screeningof one or multiple genes or other nucleic acid sequences. Geneticscreening can include screening for a mutation, which can be adisease-associated mutation, where the nucleic acid comes from a sampleobtained from a patient. Methods of the invention can also be used in atest for heterozygosity, or loss thereof, in a tumor.

Nucleic acid in a sample can be any nucleic acid including, for example,genomic DNA in a tissue sample, cDNA amplified from a particular targetin a laboratory sample, or mixed DNA from multiple organisms. In someembodiments, the sample includes homozygous DNA from a haploid ordiploid organism. For example, a sample can include genomic DNA from apatient who is homozygous for a rare recessive allele. In otherembodiments, the sample includes heterozygous genetic material from adiploid organism such that two alleles are present in substantiallyequal amounts. In certain embodiments, the sample includes geneticmaterial from a diploid or polyploid organism with a somatic mutationsuch that two related nucleic acids are present in allele frequenciesother than 50 or 100%, i.e., 20%, 5%, 1%, 0.1%, or any other allelefrequency.

In one embodiment, nucleic acid template molecules (e.g., DNA or RNA)are isolated from a biological sample containing a variety of othercomponents, such as proteins, lipids and non-template nucleic acids.Nucleic acid template molecules can be obtained from any cellularmaterial, obtained from an animal, plant, bacterium, fungus, or anyother cellular organism. Biological samples for use in the presentinvention include viral particles or preparations. Nucleic acid templatemolecules can be obtained directly from an organism or from a biologicalsample obtained from an organism, e.g., from blood, urine, cerebrospinalfluid, seminal fluid, saliva, sputum, stool and tissue. Any tissue orbody fluid specimen may be used as a source for nucleic acid for use inthe invention. Nucleic acid template molecules can also be isolated fromcultured cells, such as a primary cell culture or a cell line. The cellsor tissues from which template nucleic acids are obtained can beinfected with a virus or other intracellular pathogen. A sample can alsobe total RNA extracted from a biological specimen, a cDNA library,viral, or genomic DNA. A sample may also be isolated DNA from anon-cellular origin, e.g. amplified/isolated DNA from the freezer.

Generally, nucleic acid can be extracted from a biological sample by avariety of techniques such as those described by Maniatis, et al.,Molecular Cloning: A Laboratory Manual, 1982, Cold Spring Harbor, N.Y.,pp. 280-281; Sambrook and Russell, Molecular Cloning: A LaboratoryManual 3Ed, Cold Spring Harbor Laboratory Press, 2001, Cold SpringHarbor, N.Y.; or as described in U.S. Pub. 2002/0190663.

Nucleic acid obtained from biological samples may be fragmented toproduce suitable fragments for analysis. Template nucleic acids may befragmented or sheared to desired length, using a variety of mechanical,chemical and/or enzymatic methods. DNA may be randomly sheared viasonication, e.g. Covaris method, brief exposure to a DNase, or using amixture of one or more restriction enzymes, or a transposase or nickingenzyme. RNA may be fragmented by brief exposure to an RNase, heat plusmagnesium, or by shearing. The RNA may be converted to cDNA. Iffragmentation is employed, the RNA may be converted to cDNA before orafter fragmentation. In one embodiment, nucleic acid from a biologicalsample is fragmented by sonication. In another embodiment, nucleic acidis fragmented by a hydroshear instrument. Generally, individual nucleicacid template molecules can be from about 2 kb bases to about 40 kb. Ina particular embodiment, nucleic acids are about 6 kb-10 kb fragments.Nucleic acid molecules may be single-stranded, double-stranded, ordouble-stranded with single-stranded regions (for example, stem- andloop-structures).

A biological sample as described herein may be homogenized orfractionated in the presence of a detergent or surfactant. Theconcentration of the detergent in the buffer may be about 0.05% to about10.0%. The concentration of the detergent can be up to an amount wherethe detergent remains soluble in the solution. In one embodiment, theconcentration of the detergent is between 0.1% to about 2%. Thedetergent, particularly a mild one that is nondenaturing, can act tosolubilize the sample. Detergents may be ionic or nonionic. Examples ofnonionic detergents include triton, such as the Triton® X series(Triton® X-100 t-Oct-C₆H₄—(OCH₂—CH₂)_(x)OH, x=9-10, Triton® X-100R,Triton® X-114 x=7-8), octyl glucoside, polyoxyethylene(9)dodecyl ether,digitonin, IGEPAL® CA630 octylphenyl polyethylene glycol,n-octyl-beta-D-glucopyranoside (betaOG), n-dodecyl-beta, Tween® 20polyethylene glycol sorbitan monolaurate, Tween® 80 polyethylene glycolsorbitan monooleate, polidocanol, n-dodecyl beta-D-maltoside (DDM),NP-40 nonylphenyl polyethylene glycol, C12E8 (octaethylene glycoln-dodecyl monoether), hexaethyleneglycol mono-n-tetradecyl ether(C14EO6), octyl-beta-thioglucopyranoside (octyl thioglucoside, OTG),Emulgen, and polyoxyethylene 10 lauryl ether (C12E10). Examples of ionicdetergents (anionic or cationic) include deoxycholate, sodium dodecylsulfate (SDS), N-lauroylsarcosine, and cetyltrimethylammoniumbromide(CTAB). A zwitterionic reagent may also be used in the purificationschemes of the present invention, such as Chaps, zwitterion 3-14, and3-[(3-cholamidopropyl) dimethyl-ammonio]-1-propanesulfonate. It iscontemplated also that urea may be added with or without anotherdetergent or surfactant.

Lysis or homogenization solutions may further contain other agents, suchas reducing agents. Examples of such reducing agents includedithiothreitol (DTT), β-mercaptoethanol, DTE, GSH, cysteine, cysteamine,tricarboxyethyl phosphine (TCEP), or salts of sulfurous acid.

In various embodiments, the nucleic acid is amplified, for example, fromthe sample or after isolation from the sample. Amplification refers toproduction of additional copies of a nucleic acid sequence and isgenerally carried out using polymerase chain reaction or othertechnologies well known in the art (e.g., Dieffenbach and Dveksler, PCRPrimer, a Laboratory Manual, 1995, Cold Spring Harbor Press, Plainview,N.Y.). The amplification reaction may be any amplification reactionknown in the art that amplifies nucleic acid molecules, such aspolymerase chain reaction, nested polymerase chain reaction, polymerasechain reaction-single strand conformation polymorphism, ligase chainreaction (Barany, F., Genome Research, 1:5-16 (1991); Barany, F., PNAS,88:189-193 (1991); U.S. Pat. No. 5,869,252; and U.S. Pat. No.6,100,099), strand displacement amplification and restriction fragmentslength polymorphism, transcription based amplification system, rollingcircle amplification, and hyper-branched rolling circle amplification.Further examples of amplification techniques that can be used include,but are not limited to, quantitative PCR, quantitative fluorescent PCR(QF-PCR), multiplex fluorescent PCR (MF-PCR), real time PCR (RTPCR),single cell PCR, restriction fragment length polymorphism PCR(PCR-RFLP), RT-PCR-RFLP, hot start PCR, in situ polonony PCR, in siturolling circle amplification (RCA), bridge PCR, picotiter PCR andemulsion PCR. Other suitable amplification methods include transcriptionamplification, self-sustained sequence replication, selectiveamplification of target polynucleotide sequences, consensus sequenceprimed polymerase chain reaction (CP-PCR), arbitrarily primed polymerasechain reaction (AP-PCR), degenerate oligonucleotide-primed PCR (DOP-PCR)and nucleic acid based sequence amplification (NABSA). Otheramplification methods that can be used herein include those described inU.S. Pats. 5,242,794; 5,494,810; 4,988,617; and 6,582,938.

In certain embodiments, the amplification reaction is the polymerasechain reaction. Polymerase chain reaction (PCR) refers to methods by K.B. Mullis (U.S. Pats. 4,683,195 and 4,683,202, hereby incorporated byreference) for increasing concentration of a segment of a targetsequence in a mixture of genomic DNA without cloning or purification.

Primers can be prepared by a variety of methods including but notlimited to cloning of appropriate sequences and direct chemicalsynthesis using methods well known in the art (Narang et al., MethodsEnzymol., 68:90 (1979); Brown et al., Methods Enzymol., 68:109 (1979)).Primers can also be obtained from commercial sources such as OperonTechnologies, Amersham Pharmacia Biotech, Sigma, and Life Technologies.The primers can have an identical melting temperature. The lengths ofthe primers can be extended or shortened at the 5′ end or the 3′ end toproduce primers with desired melting temperatures. Also, the annealingposition of each primer pair can be designed such that the sequence andlength of the primer pairs yield the desired melting temperature. Thesimplest equation for determining the melting temperature of primerssmaller than 25 base pairs is the Wallace Rule (Td=2(A+T)+4(G+C)).Computer programs can also be used to design primers, including but notlimited to Array Designer Software from Arrayit Corporation (Sunnyvale,Calif.), Oligonucleotide Probe Sequence Design Software for GeneticAnalysis from Olympus Optical Co., Ltd. (Tokyo, Japan), NetPrimer, andDNAsis Max v3.0 from Hitachi Solutions America, Ltd. (South SanFrancisco, Calif.). The TM (melting or annealing temperature) of eachprimer is calculated using software programs such as OligoAnalyzer 3.1,available on the web site of Integrated DNA Technologies, Inc.(Coralville, Iowa).

With PCR, it is possible to amplify a single copy of a specific targetsequence in genomic DNA to a level that can be detected by severaldifferent methodologies (e.g., staining; hybridization with a labeledprobe; incorporation of biotinylated primers followed by avidin-enzymeconjugate detection; or incorporation of 32P-labeled deoxynucleotidetriphosphates, such as dCTP or dATP, into the amplified segment). Inaddition to genomic DNA, any oligonucleotide sequence can be amplifiedwith the appropriate set of primer molecules. In particular, theamplified segments created by the PCR process itself are, themselves,efficient templates for subsequent PCR amplifications.

Amplification adapters may be attached to the fragmented nucleic acid.Adapters may be commercially obtained, such as from Integrated DNATechnologies (Coralville, Iowa). In certain embodiments, the adaptersequences are attached to the template nucleic acid molecule with anenzyme. The enzyme may be a ligase or a polymerase. The ligase may beany enzyme capable of ligating an oligonucleotide (RNA or DNA) to thetemplate nucleic acid molecule. Suitable ligases include T4 DNA ligaseand T4 RNA ligase, available commercially from New England Biolabs(Ipswich, Mass.). Methods for using ligases are well known in the art.The polymerase may be any enzyme capable of adding nucleotides to the 3′and the 5′ terminus of template nucleic acid molecules.

The ligation may be blunt ended or via use of complementary overhangingends. In certain embodiments, following fragmentation, the ends of thefragments may be repaired, trimmed (e.g. using an exonuclease), orfilled (e.g., using a polymerase and dNTPs) to form blunt ends. In someembodiments, end repair is performed to generate blunt end 5′phosphorylated nucleic acid ends using commercial kits, such as thoseavailable from Epicentre Biotechnologies (Madison, Wis.). Upongenerating blunt ends, the ends may be treated with a polymerase anddATP to form a template independent addition to the 3′-end and the5′-end of the fragments, thus producing a single A overhanging. Thissingle A is used to guide ligation of fragments with a single Toverhanging from the 5′-end in a method referred to as T-A cloning.

Alternatively, because the possible combination of overhangs left by therestriction enzymes are known after a restriction digestion, the endsmay be left as-is, i.e., ragged ends. In certain embodiments doublestranded oligonucleotides with complementary overhanging ends are used.

In certain embodiments, a single bar code is attached to each fragment.In other embodiments, a plurality of bar codes, e.g., two bar codes, areattached to each fragment.

A bar code sequence generally includes certain features that make thesequence useful in sequencing reactions. For example the bar codesequences are designed to have minimal or no homopolymer regions, i.e.,2 or more of the same base in a row such as AA or CCC, within the barcode sequence. The bar code sequences are also designed so that they areat least one edit distance away from the base addition order whenperforming base-by-base sequencing, ensuring that the first and lastbase do not match the expected bases of the sequence.

The bar code sequences are designed such that each sequence iscorrelated to a particular portion of nucleic acid, allowing sequencereads to be correlated back to the portion from which they came. Methodsof designing sets of bar code sequences is shown for example in U.S.Pat. No. 6,235,475, the contents of which are incorporated by referenceherein in their entirety. In certain embodiments, the bar code sequencesrange from about 5 nucleotides to about 15 nucleotides. In a particularembodiment, the bar code sequences range from about 4 nucleotides toabout 7 nucleotides. Since the bar code sequence is sequenced along withthe template nucleic acid, the oligonucleotide length should be ofminimal length so as to permit the longest read from the templatenucleic acid attached. Generally, the bar code sequences are spaced fromthe template nucleic acid molecule by at least one base (minimizeshomopolymeric combinations).

Methods of the invention involve attaching the bar code sequences to thetemplate nucleic acids. In certain embodiments, the bar code sequencesare attached to the template nucleic acid molecule with an enzyme. Theenzyme may be a ligase or a polymerase, as discussed above. Attachingbar code sequences to nucleic acid templates is shown in U.S. Pub.2008/0081330 and U.S. Pub. 2011/0301042, the content of each of which isincorporated by reference herein in its entirety. Methods for designingsets of bar code sequences and other methods for attaching bar codesequences are shown in U.S. Pats. 6,138,077; 6,352,828; 5,636,400;6,172,214; 6235,475; 7,393,665; 7,544,473; 5,846,719; 5,695,934;5,604,097; 6,150,516; RE39,793; 7,537,897; 6,172,218; and 5,863,722, thecontent of each of which is incorporated by reference herein in itsentirety.

After any processing steps (e.g., obtaining, isolating, fragmenting, oramplification), nucleic acid can be sequenced according to certainembodiments of the invention.

Sequencing may be by any method known in the art. DNA sequencingtechniques include classic dideoxy sequencing reactions (Sanger method)using labeled terminators or primers and gel separation in slab orcapillary, sequencing by synthesis using reversibly terminated labelednucleotides, pyrosequencing, 454 sequencing, Illumina/Solexa sequencing,allele specific hybridization to a library of labeled oligonucleotideprobes, sequencing by synthesis using allele specific hybridization to alibrary of labeled clones that is followed by ligation, real timemonitoring of the incorporation of labeled nucleotides during apolymerization step, polony sequencing, and SOLiD sequencing. Sequencingof separated molecules has more recently been demonstrated by sequentialor single extension reactions using polymerases or ligases as well as bysingle or sequential differential hybridizations with libraries ofprobes.

A sequencing technique that can be used in the methods of the providedinvention includes, for example, 454 sequencing (454 Life Sciences, aRoche company, Branford, Conn.) (Margulies, M et al., Nature,437:376-380 (2005); U.S. Pat. No. 5,583,024; U.S. Pat. No. 5,674,713;and U.S. Pat. No. 5,700,673). 454 sequencing involves two steps. In thefirst step, DNA is sheared into fragments of approximately 300-800 basepairs, and the fragments are blunt ended. Oligonucleotide adaptors arethen ligated to the ends of the fragments. The adaptors serve as primersfor amplification and sequencing of the fragments. The fragments can beattached to DNA capture beads, e.g., streptavidin-coated beads using,e.g., Adaptor B, which contains 5′-biotin tag. The fragments attached tothe beads are PCR amplified within droplets of an oil-water emulsion.The result is multiple copies of clonally amplified DNA fragments oneach bead. In the second step, the beads are captured in wells(pico-liter sized). Pyrosequencing is performed on each DNA fragment inparallel. Addition of one or more nucleotides generates a light signalthat is recorded by a CCD camera in a sequencing instrument. The signalstrength is proportional to the number of nucleotides incorporated.Pyrosequencing makes use of pyrophosphate (PPi) which is released uponnucleotide addition. PPi is converted to ATP by ATP sulfurylase in thepresence of adenosine 5′ phosphosulfate. Luciferase uses ATP to convertluciferin to oxyluciferin, and this reaction generates light that isdetected and analyzed.

Another example of a DNA sequencing technique that can be used in themethods of the provided invention is SOLiD technology by AppliedBiosystems from Life Technologies Corporation (Carlsbad, Calif.). InSOLiD sequencing, genomic DNA is sheared into fragments, and adaptorsare attached to the 5′ and 3′ ends of the fragments to generate afragment library. Alternatively, internal adaptors can be introduced byligating adaptors to the 5′ and 3′ ends of the fragments, circularizingthe fragments, digesting the circularized fragment to generate aninternal adaptor, and attaching adaptors to the 5′ and 3′ ends of theresulting fragments to generate a mate-paired library. Next, clonal beadpopulations are prepared in microreactors containing beads, primers,template, and PCR components. Following PCR, the templates are denaturedand beads are enriched to separate the beads with extended templates.Templates on the selected beads are subjected to a 3′ modification thatpermits bonding to a glass slide. The sequence can be determined bysequential hybridization and ligation of partially randomoligonucleotides with a central determined base (or pair of bases) thatis identified by a specific fluorophore. After a color is recorded, theligated oligonucleotide is cleaved and removed and the process is thenrepeated.

Another example of a DNA sequencing technique that can be used in themethods of the provided invention is Ion Torrent sequencing, described,for example, in U.S. Pubs. 2009/0026082, 2009/0127589, 2010/0035252,2010/0137143, 2010/0188073, 2010/0197507, 2010/0282617, 2010/0300559,2010/0300895, 2010/0301398, and 2010/0304982, the content of each ofwhich is incorporated by reference herein in its entirety. In IonTorrent sequencing, DNA is sheared into fragments of approximately300-800 base pairs, and the fragments are blunt ended. Oligonucleotideadaptors are then ligated to the ends of the fragments. The adaptorsserve as primers for amplification and sequencing of the fragments. Thefragments can be attached to a surface and are attached at a resolutionsuch that the fragments are individually resolvable. Addition of one ormore nucleotides releases a proton (H⁺), which signal is detected andrecorded in a sequencing instrument. The signal strength is proportionalto the number of nucleotides incorporated.

Another example of a sequencing technology that can be used in themethods of the provided invention is Illumina sequencing. Illuminasequencing is based on the amplification of DNA on a solid surface usingfold-back PCR and anchored primers. Genomic DNA is fragmented, andadapters are added to the 5′ and 3′ ends of the fragments. DNA fragmentsthat are attached to the surface of flow cell channels are extended andbridge amplified. The fragments become double stranded, and the doublestranded molecules are denatured. Multiple cycles of the solid-phaseamplification followed by denaturation can create several millionclusters of approximately 1,000 copies of single-stranded DNA moleculesof the same template in each channel of the flow cell. Primers, DNApolymerase and four fluorophore-labeled, reversibly terminatingnucleotides are used to perform sequential sequencing. After nucleotideincorporation, a laser is used to excite the fluorophores, and an imageis captured and the identity of the first base is recorded. The 3′terminators and fluorophores from each incorporated base are removed andthe incorporation, detection and identification steps are repeated.Sequencing according to this technology is described in U.S. Pub.2011/0009278, U.S. Pub. 2007/0114362, U.S. Pub. 2006/0024681, U.S. Pub.2006/0292611, U.S. Pat. No. 7,960,120, U.S. Pat. No. 7,835,871, U.S.Pat. No. 7,232,656, U.S. Pat. No. 7,598,035, U.S. Pat. No. 6,306,597,U.S. Pat. No. 6,210,891, U.S. Pat. No. 6,828,100, U.S. Pat. No.6,833,246, and U.S. Pat. No. 6,911,345, each of which are hereinincorporated by reference in their entirety.

Another example of a sequencing technology that can be used in themethods of the provided invention includes the single molecule,real-time (SMRT) technology of Pacific Biosciences (Menlo Park, Calif.).In SMRT, each of the four DNA bases is attached to one of four differentfluorescent dyes. These dyes are phospholinked. A single DNA polymeraseis immobilized with a single molecule of template single stranded DNA atthe bottom of a zero-mode waveguide (ZMW). A ZMW is a confinementstructure which enables observation of incorporation of a singlenucleotide by DNA polymerase against the background of fluorescentnucleotides that rapidly diffuse in and out of the ZMW (inmicroseconds). It takes several milliseconds to incorporate a nucleotideinto a growing strand. During this time, the fluorescent label isexcited and produces a fluorescent signal, and the fluorescent tag iscleaved off. Detection of the corresponding fluorescence of the dyeindicates which base was incorporated. The process is repeated.

Another example of a sequencing technique that can be used in themethods of the provided invention is nanopore sequencing (Soni, G. V.,and Meller, A., Clin Chem 53: 1996-2001 (2007)). A nanopore is a smallhole, of the order of 1 nanometer in diameter. Immersion of a nanoporein a conducting fluid and application of a potential across it resultsin a slight electrical current due to conduction of ions through thenanopore. The amount of current which flows is sensitive to the size ofthe nanopore. As a DNA molecule passes through a nanopore, eachnucleotide on the DNA molecule obstructs the nanopore to a differentdegree. Thus, the change in the current passing through the nanopore asthe DNA molecule passes through the nanopore represents a reading of theDNA sequence.

Another example of a sequencing technique that can be used in themethods of the provided invention involves using a chemical-sensitivefield effect transistor (chemFET) array to sequence DNA (for example, asdescribed in U.S. Pub. 2009/0026082). In one example of the technique,DNA molecules can be placed into reaction chambers, and the templatemolecules can be hybridized to a sequencing primer bound to apolymerase. Incorporation of one or more triphosphates into a newnucleic acid strand at the 3′ end of the sequencing primer can bedetected by a change in current by a chemFET. An array can have multiplechemFET sensors. In another example, single nucleic acids can beattached to beads, and the nucleic acids can be amplified on the bead,and the individual beads can be transferred to individual reactionchambers on a chemFET array, with each chamber having a chemFET sensor,and the nucleic acids can be sequenced.

Another example of a sequencing technique that can be used in themethods of the provided invention involves using a electron microscope(Moudrianakis E. N. and Beer M., PNAS, 53:564-71 (1965)). In one exampleof the technique, individual DNA molecules are labeled using metalliclabels that are distinguishable using an electron microscope. Thesemolecules are then stretched on a flat surface and imaged using anelectron microscope to measure sequences.

Sequencing according to embodiments of the invention generates aplurality of reads. Reads according to the invention generally includesequences of nucleotide data less than about 150 bases in length, orless than about 90 bases in length. In certain embodiments, reads arebetween about 80 and about 90 bases, e.g., about 85 bases in length. Insome embodiments, methods of the invention are applied to very shortreads, i.e., less than about 50 or about 30 bases in length. Afterobtaining sequence reads, they are further processed as diagrammed inFIG. 1.

FIG. 1 is a diagram of methods of the invention. Methods includeobtaining 101 sequence reads and assembling 105 them into a contig,which is then aligned 109 to a reference. Differences are identified bycomparison 113. The raw reads are aligned 117 to the contigs andpositional and variant information is mapped to the reads from thereference via the contig, allowing genotyping 121 to be performed.

FIG. 2 illustrates an exemplary embodiment according to methods of theinvention. In step 1, reads are assembled into contigs. In the case of asingle, homozygous target nucleic acid, a single contig per location istypically produced. Assembly can include any method including thosediscussed herein.

In step 2 shown in FIG. 2, each contig is aligned to a reference genome.Alignment can be by any method such as those discussed below, including,e.g., the bwa-sw algorithm implemented by BWA. As shown in FIG. 2, bothalign to the same reference position. Differences between the contig andthe reference genome are identified and, as shown in FIG. 2, describedby a CIGAR string.

In step 3, raw reads are aligned to contigs (using any method such as,for example, BWA with bwa-short and writing, for example, a CIGARstring).

As shown in FIG. 2, step 4, raw read alignments are mapped from contigspace to original reference space (e.g., via position and CIGARinformation).

In step 5, genotyping is performed using the translated, aligned readsfrom step 4 (e.g., including raw quality scores for substitutions).

A contig, generally, refers to the relationship between or among aplurality of segments of nucleic acid sequences, e.g., reads. Wheresequence reads overlap, a contig can be represented as a layered imageof overlapping reads. A contig is not defined by, nor limited to, anyparticular visual arrangement nor any particular arrangement within, forexample, a text file or a database. A contig generally includes sequencedata from a number of reads organized to correspond to a portion of asequenced nucleic acid. A contig can include assembly results—such as aset of reads or information about their positions relative to each otheror to a reference—displayed or stored. A contig can be structured as agrid, in which rows are individual sequence reads and columns includethe base of each read that is presumed to align to that site. Aconsensus sequence can be made by identifying the predominant base ineach column of the assembly. A contig according to the invention caninclude the visual display of reads showing them overlap (or not, e.g.,simply abutting) one another. A contig can include a set of coordinatesassociated with a plurality of reads and giving the position of thereads relative to each other. A contig can include data obtained bytransforming the sequence data of reads. For example, a Burrows-Wheelertransformation can be performed on the reads, and a contig can includethe transformed data without necessarily including the untransformedsequences of the reads. A Burrows-Wheeler transform of nucleotidesequence data is described in U.S. Pub. 2005/0032095, hereinincorporated by reference in its entirety.

Reads can be assembled into contigs by any method known in the art.Algorithms for the de novo assembly of a plurality of sequence reads areknown in the art. One algorithm for assembling sequence reads is knownas overlap consensus assembly. Overlap consensus assembly uses theoverlap between sequence reads to create a link between them. The readsare generally linked by regions that overlap enough that non-randomoverlap is assumed. Linking together reads in this way produces a contigor an overlap graph in which each node corresponds to a read and an edgerepresents an overlap between two reads. Assembly with overlap graphs isdescribed, for example, in U.S. Pat. No. 6,714,874.

In some embodiments, de novo assembly proceeds according to so-calledgreedy algorithms. For assembly according to greedy algorithms, one ofthe reads of a group of reads is selected, and it is paired with anotherread with which it exhibits a substantial amount of overlap—generally itis paired with the read with which it exhibits the most overlap of allof the other reads. Those two reads are merged to form a new readsequence, which is then put back in the group of reads and the processis repeated. Assembly according to a greedy algorithm is described, forexample, in Schatz, et al., Genome Res., 20:1165-1173 (2010) and U.S.Pub. 2011/0257889, each of which is hereby incorporated by reference inits entirety.

In other embodiments, assembly proceeds by pairwise alignment, forexample, exhaustive or heuristic (e.g., not exhaustive) pairwisealignment. Alignment, generally, is discussed in more detail below.Exhaustive pairwise alignment, sometimes called a “brute force”approach, calculates an alignment score for every possible alignmentbetween every possible pair of sequences among a set. Assembly byheuristic multiple sequence alignment ignores certain mathematicallyunlikely combinations and can be computationally faster. One heuristicmethod of assembly by multiple sequence alignment is the so-called“divide-and-conquer” heuristic, which is described, for example, in U.S.Pub. 2003/0224384. Another heuristic method of assembly by multiplesequence alignment is progressive alignment, as implemented by theprogram ClustalW (see, e.g., Thompson, et al., Nucl. Acids. Res.,22:4673-80 (1994)). Assembly by multiple sequence alignment in generalis discussed in Lecompte, O., et al., Gene 270:17-30 (2001); Mullan, L.J., Brief Bioinform., 3:303-5 (2002); Nicholas, H. B. Jr., et al.,Biotechniques 32:572-91 (2002); and Xiong, G., Essential Bioinformatics,2006, Cambridge University Press, New York, N.Y.

Assembly by alignment can proceed by aligning reads to each other or byaligning reads to a reference. For example, by aligning each read, inturn, to a reference genome, all of the reads are positioned inrelationship to each other to create the assembly.

One method of assembling reads into contigs involves making a de Bruijngraph. De Bruijn graphs reduce the computation effort by breaking readsinto smaller sequences of DNA, called k-mers, where the parameter kdenotes the length in bases of these sequences. In a de Bruijn graph,all reads are broken into k-mers (all subsequences of length k withinthe reads) and a path between the k-mers is calculated. In assemblyaccording to this method, the reads are represented as a path throughthe k-mers. The de Bruijn graph captures overlaps of length k−1 betweenthese k-mers and not between the actual reads. Thus, for example, thesequencing CATGGA could be represented as a path through the following2-mers: CA, AT, TG, GG, and GA. The de Bruijn graph approach handlesredundancy well and makes the computation of complex paths tractable. Byreducing the entire data set down to k-mer overlaps, the de Bruijn graphreduces the high redundancy in short-read data sets. The maximumefficient k-mer size for a particular assembly is determined by the readlength as well as the error rate. The value of the parameter k hassignificant influence on the quality of the assembly. Estimates of goodvalues can be made before the assembly, or the optimal value can befound by testing a small range of values. Assembly of reads using deBruijn graphs is described in U.S. Pub. 2011/0004413, U.S. Pub.2011/0015863, and U.S. Pub. 2010/0063742, each of which are hereinincorporated by reference in their entirety.

Other methods of assembling reads into contigs according to theinvention are possible. For example, the reads may contain barcodeinformation inserted into template nucleic acid during sequencing. Incertain embodiments, reads are assembled into contigs by reference tothe barcode information. For example, the barcodes can be identified andthe reads can be assembled by positioning the barcodes together.

In certain embodiments, assembly proceeds by making reference tosupplied information about the expected position of the various readsrelative to each other. This can be obtained, for example, if thesubject nucleic acid being sequenced has been captured by molecularinversion probes, because the start of each read derives from a genomicposition that is known and specified by the probe set design. Each readcan be collected according to the probe from which it was designed andpositioned according to its known relative offset. In some embodiments,information about the expected position of reads relative to each otheris supplied by knowledge of the positions (e.g., within a gene) of anarea of nucleic acid amplified by primers. For example, sequencing canbe done on amplification product after a number of regions of the targetnucleic acid are amplified using primer pairs designed or known to coverthose regions. Reads can then be positioned during assembly at leastbased on which primer pair was used in an amplification that lead tothose reads. Assembly of reads into contigs can proceed by anycombination or hybrid of methods including, but not limited to, theabove-referenced methods.

Assembly of reads into contigs is further discussed in Husemann, P. andStoye, J, Phylogenetic Comparative Assembly, 2009, Algorithms inBioinformatics: 9th International Workshop, pp. 145-156, Salzberg, S.,and Warnow, T., Eds. Springer-Verlag, Berlin Heidelberg. Some exemplarymethods for assembling reads into contigs are described, for example, inU.S. Pat. No. 6,223,128, U.S. Pub. 2009/0298064, U.S. Pub. 2010/0069263,and U.S. Pub. 2011/0257889, each of which is incorporated by referenceherein in its entirety.

Computer programs for assembling reads are known in the art. Suchassembly programs can run on a single general-purpose computer, on acluster or network of computers, or on a specialized computing devicesdedicated to sequence analysis.

Assembly can be implemented, for example, by the program ‘The ShortSequence Assembly by k-mer search and 3′ read Extension’ (SSAKE), fromCanada's Michael Smith Genome Sciences Centre (Vancouver, B.C., CA)(see, e.g., Warren, R., et al., Bioinformatics, 23:500-501 (2007)).SSAKE cycles through a table of reads and searches a prefix tree for thelongest possible overlap between any two sequences. SSAKE clusters readsinto contigs.

Another read assembly program is Forge Genome Assembler, written byDarren Platt and Dirk Evers and available through the SourceForge website maintained by Geeknet (Fairfax, Va.) (see, e.g., DiGuistini, S., etal., Genome Biology, 10:R94 (2009)). Forge distributes its computationaland memory consumption to multiple nodes, if available, and hastherefore the potential to assemble large sets of reads. Forge waswritten in C++ using the parallel MPI library. Forge can handle mixturesof reads, e.g., Sanger, 454, and Illumina reads.

Assembly through multiple sequence alignment can be performed, forexample, by the program Clustal Omega, (Sievers F., et al., Mol SystBiol 7 (2011)), ClustalW, or ClustalX (Larkin M. A., et al.,Bioinformatics, 23, 2947-2948 (2007)) available from University CollegeDublin (Dublin, Ireland).

Another exemplary read assembly program known in the art is Velvet,available through the web site of the European Bioinformatics Institute(Hinxton, UK) (Zerbino D. R. et al., Genome Research 18(5):821-829(2008)). Velvet implements an approach based on de Bruijn graphs, usesinformation from read pairs, and implements various error correctionsteps.

Read assembly can be performed with the programs from the package SOAP,available through the website of Beijing Genomics Institute (Beijing,Conn.) or BGI Americas Corporation (Cambridge, Mass.). For example, theSOAPdenovo program implements a de Bruijn graph approach. SOAP3/GPUaligns short reads to a reference sequence.

Another read assembly program is ABySS, from Canada's Michael SmithGenome Sciences Centre (Vancouver, B.C., CA) (Simpson, J. T., et al.,Genome Res., 19(6):1117-23 (2009)). ABySS uses the de Bruijn graphapproach and runs in a parallel environment.

Read assembly can also be done by Roche's GS De Novo Assembler, known asgsAssembler or Newbler (NEW assemBLER), which is designed to assemblereads from the Roche 454 sequencer (described, e.g., in Kumar, S. etal., Genomics 11:571 (2010) and Margulies, et al., Nature 437:376-380(2005)). Newbler accepts 454 Flx Standard reads and 454 Titanium readsas well as single and paired-end reads and optionally Sanger reads.Newbler is run on Linux, in either 32 bit or 64 bit versions. Newblercan be accessed via a command-line or a Java-based GUI interface.

Cortex, created by Mario Caccamo and Zamin Iqbal at the University ofOxford, is a software framework for genome analysis, including readassembly. Cortex includes cortex_con for consensus genome assembly, usedas described in Spanu, P. D., et al., Science 330(6010):1543-46 (2010).Cortex includes cortex_var for variation and population assembly,described in Iqbal, et al., De novo assembly and genotyping of variantsusing colored de Bruijn graphs, Nature Genetics (in press), and used asdescribed in Mills, R. E., et al., Nature 470:59-65 (2010). Cortex isavailable through the creators' web site and from the SourceForge website maintained by Geeknet (Fairfax, Va.).

Other read assembly programs include RTG Investigator from Real TimeGenomics, Inc. (San Francisco, Calif.); iAssembler (Zheng, et al., BMCBioinformatics 12:453 (2011)); TgiCL Assembler (Pertea, et al.,Bioinformatics 19(5):651-52 (2003)); Maq (Mapping and Assembly withQualities) by Heng Li, available for download through the SourceForgewebsite maintained by Geeknet (Fairfax, Va.); MIRA3 (MimickingIntelligent Read Assembly), described in Chevreux, B., et al., GenomeSequence Assembly Using Trace Signals and Additional SequenceInformation, 1999, Computer Science and Biology: Proceedings of theGerman Conference on Bioinformatics (GCB) 99:45-56; PGA4genomics(described in Zhao F., et al., Genomics. 94(4):284-6 (2009)); and Phrap(described, e.g., in de la Bastide, M. and McCombie, W. R., CurrentProtocols in Bioinformatics, 17:11.4.1-11.4.15 (2007)). CLC cell is a deBruijn graph-based computer program for read mapping and de novoassembly of NGS reads available from CLC bio Germany (Muehltal,Germany).

Assembly of reads produces one or more contigs. In the case of ahomozygous or single target sequencing, a single contig will beproduced. In the case of a heterozygous diploid target, a rare somaticmutation, or a mixed sample, for example, two or more contigs can beproduced. Each contig includes information from the reads that make upthat contig.

Assembling the reads into contigs is conducive to producing a consensussequence corresponding to each contig. In certain embodiments, aconsensus sequence refers to the most common, or predominant, nucleotideat each position from among the assembled reads. A consensus sequencecan represent an interpretation of the sequence of the nucleic acidrepresented by that contig.

The invention provides methods for positioning a consensus sequence or acontig along a reference genome. In certain embodiments, a consensussequence or a contig is positioned on a reference through informationfrom known molecular markers or probes. In some embodiments,protein-coding sequence data in a contig, consensus sequence, orreference genome is represented by amino acid sequence and a contig ispositioned along a reference genome. In some embodiments, a consensussequence or a contig is positioned by an alignment of the contig to areference genome.

Alignment, as used herein, generally involves placing one sequence alonganother sequence, iteratively introducing gaps along each sequence,scoring how well the two sequences match, and preferably repeating forvarious position along the reference. The best-scoring match is deemedto be the alignment and represents an inference about the historicalrelationship between the sequences. In an alignment, a base in the readalongside a non-matching base in the reference indicates that asubstitution mutation has occurred at that point. Similarly, where onesequence includes a gap alongside a base in the other sequence, aninsertion or deletion mutation (an “indel”) is inferred to haveoccurred. When it is desired to specify that one sequence is beingaligned to one other, the alignment is sometimes called a pairwisealignment. Multiple sequence alignment generally refers to the alignmentof two or more sequences, including, for example, by a series ofpairwise alignments.

In some embodiments, scoring an alignment involves setting values forthe probabilities of substitutions and indels. When individual bases arealigned, a match or mismatch contributes to the alignment score by asubstitution probability, which could be, for example, 1 for a match and0.33 for a mismatch. An indel deducts from an alignment score by a gappenalty, which could be, for example, −1. Gap penalties and substitutionprobabilities can be based on empirical knowledge or a prioriassumptions about how sequences mutate. Their values affects theresulting alignment. Particularly, the relationship between the gappenalties and substitution probabilities influences whethersubstitutions or indels will be favored in the resulting alignment.

Stated formally, an alignment represents an inferred relationshipbetween two sequences, x and y. For example, in some embodiments, analignment A of sequences x and y maps x and y respectively to anothertwo strings x′ and y′ that may contain spaces such that: (i) |x′|=|y′|;(ii) removing spaces from x′ and y′ should get back x and y,respectively; and (iii) for any i, x′[i] and y′[i] cannot be bothspaces.

A gap is a maximal substring of contiguous spaces in either x′ or y′. Analignment A can include the following three kinds of regions: (i)matched pair (e.g., x′[i]=y′[i]; (ii) mismatched pair, (e.g.,x′[i]≠y′[i] and both are not spaces); or (iii) gap (e.g., either x′[i .. . j] or is a gap). In certain embodiments, only a matched pair has ahigh positive score a. In some embodiments, a mismatched pair generallyhas a negative score b and a gap of length r also has a negative scoreg+rs where g, s<0. For DNA, one common scoring scheme (e.g. used byBLAST) makes score a=1, score b=−3, g=−5 and s=−2. The score of thealignment A is the sum of the scores for all matched pairs, mismatchedpairs and gaps. The alignment score of x and y can be defined as themaximum score among all possible alignments of x and y.

In some embodiments, any pair has a score a defined by a 4×4 matrix B ofsubstitution probabilities. For example, B(i,i)=1 and 0<B(i,j)_(i<>j)<1is one possible scoring system. For instance, where a transition isthought to be more biologically probable than a transversion, matrix Bcould include B(C,T)=0.7 and B(A,T)=0.3, or any other set of valuesdesired or determined by methods known in the art.

Alignment according to some embodiments of the invention includespairwise alignment. A pairwise alignment, generally, involves—forsequence Q (query) having m characters and a reference genome T (target)of n characters—finding and evaluating possible local alignments betweenQ and T. For any 1≦i≦n and 1≦j≦m, the largest possible alignment scoreof T[h . . . i] and Q[k . . . j], where h≦i and k≦j, is computed (i.e.the best alignment score of any substring of T ending at position i andany substring of Q ending at position j). This can include examining allsubstrings with cm characters, where c is a constant depending on asimilarity model, and aligning each substring separately with Q. Eachalignment is scored, and the alignment with the preferred score isaccepted as the alignment. In some embodiments an exhaustive pairwisealignment is performed, which generally includes a pairwise alignment asdescribed above, in which all possible local alignments (optionallysubject to some limiting criteria) between Q and T are scored.

In some embodiments, pairwise alignment proceeds according to dot-matrixmethods, dynamic programming methods, or word methods. Dynamicprogramming methods generally implement the Smith-Waterman (SW)algorithm or the Needleman-Wunsch (NW) algorithm. Alignment according tothe NW algorithm generally scores aligned characters according to asimilarity matrix S(a,b) (e.g., such as the aforementioned matrix B)with a linear gap penalty d. Matrix S(a,b) generally suppliessubstitution probabilities. The SW algorithm is similar to the NWalgorithm, but any negative scoring matrix cells are set to zero. The SWand NW algorithms, and implementations thereof, are described in moredetail in U.S. Pat. No. 5,701,256 and U.S. Pub. 2009/0119313, bothherein incorporated by reference in their entirety. Computer programsknown in the art for implementing these methods are described in moredetail below.

In certain embodiments, an exhaustive pairwise alignment is avoided bypositioning a consensus sequence or a contig along a reference genomethrough the use of a transformation of the sequence data. One usefulcategory of transformation according to some embodiments of theinvention involve making compressed indexes of sequences (see, e.g.,Lam, et al., Compressed indexing and local alignment of DNA, 2008,Bioinformatics 24(6):791-97). Exemplary compressed indexes include theFN-index, the compressed suffix array, and the Burrows-Wheeler Transform(BWT, described in more detail below).

In certain embodiments, the invention provides methods of alignmentwhich avoid an exhaustive pairwise alignment by making a suffix tree(sometime known as a suffix trie). Given a reference genome T, a suffixtree for T is a tree comprising all suffices of T such that each edge isuniquely labeled with a character, and the concatenation of the edgelabels on a path from the root to a leaf corresponds to a unique suffixof T. Each leaf stores the starting location of the correspondingsuffix.

On a suffix tree, distinct substrings of T are represented by differentpaths from the root of the suffix tree. Then, Q is aligned against eachpath from the root up to cm characters (e.g., using dynamicprogramming). The common prefix structure of the paths also gives a wayto share the common parts of the dynamic programming on different paths.A pre-order traversal of the suffix tree is performed; at each node, adynamic programming table (DP table) is maintained for aligning thepattern and the path up to the node. More rows are added to the tablewhile proceeding down the tree, and corresponding rows are deleted whileascending the tree.

In certain embodiments, a BWT is used to index reference T, and theindex is used to emulate a suffix tree. The Burrows—Wheeler transform(BWT) (Burrow and Wheeler, 1994, A block-sorting lossless datacompression algorithm, Technical Report 124, Digital EquipmentCorporation, CA) was invented as a compression technique and laterextended to support pattern matching. To perform a BWT, first let T be astring of length n over an alphabet Σ. Assume that the last character ofT is a unique special character $, which is smaller than any characterin Σ. The suffix array SA[0, n−1] of T is an array of indexes such thatSA[i] stores the starting position of the i-th-lexicographicallysmallest suffix. The BWT of T is a permutation of T such that BWT [i]=T[SA[i]−1]. For example, if T=‘acaacg$’, then SA=(8, 3, 1, 4, 2, 5, 6,7), and BWT=‘gc$aaace’.

Alignment generally involves finding the best alignment score among substrings of T and Q. Using a BWT of T speeds up this step by avoidingaligning substrings of T that are identical. This method exploits thecommon prefix structure of a tree to avoid aligning identical substringsmore than once. Use of a pre-order traversal of the suffix treegenerates all distinct substrings of T. Further, only substrings of T oflength at most cm, where c is usually a constant bounded by 2, areconsidered, because the score of a match is usually smaller than thepenalty due to a mismatch/insert/delete, and a substring of T with morethan 2m characters has at most m matches and an alignment score lessthan 0. Implementation of the method for aligning sequence data isdescribed in more detail in Lam, et al., Bioinformatics 24(6):791-97(2008).

An alignment according to the invention can be performed using anysuitable computer program known in the art.

One exemplary alignment program, which implements a BWT approach, isBurrows-Wheeler Aligner (BWA) available from the SourceForge web sitemaintained by Geeknet (Fairfax, Va.). BWA can align reads, contigs, orconsensus sequences to a reference. BWT occupies 2 bits of memory pernucleotide, making it possible to index nucleotide sequences as long as4G base pairs with a typical desktop or laptop computer. Thepre-processing includes the construction of BWT (i.e., indexing thereference) and the supporting auxiliary data structures.

BWA implements two different algorithms, both based on BWT. Alignment byBWA can proceed using the algorithm bwa-short, designed for shortqueries up to ˜200 bp with low error rate (<3%) (Li H. and Durbin R.Bioinformatics, 25:1754-60 (2009)). The second algorithm, BWA-SW, isdesigned for long reads with more errors (Li H. and Durbin R. (2010)Fast and accurate long-read alignment with Burrows-Wheeler Transform.Bioinformatics, Epub.). The BWA-SW component performs heuristicSmith-Waterman-like alignment to find high-scoring local hits. Oneskilled in the art will recognize that bwa-sw is sometimes referred toas “bwa-long”, “bwa long algorithm”, or similar. Such usage generallyrefers to BWA-SW.

An alignment program that implements a version of the Smith-Watermanalgorithm is MUMmer, available from the SourceForge web site maintainedby Geeknet (Fairfax, Va.). MUMmer is a system for rapidly aligningentire genomes, whether in complete or draft form (Kurtz, S., et al.,Genome Biology, 5:R12 (2004); Delcher, A. L., et al., Nucl. Acids Res.,27:11 (1999)). For example, MUMmer 3.0 can find all 20-basepair orlonger exact matches between a pair of 5-megabase genomes in 13.7seconds, using 78 MB of memory, on a 2.4 GHz Linux desktop computer.MUMmer can also align incomplete genomes; it can easily handle the 100 sor 1000 s of contigs from a shotgun sequencing project, and will alignthem to another set of contigs or a genome using the NUCmer programincluded with the system. If the species are too divergent for a DNAsequence alignment to detect similarity, then the PROmer program cangenerate alignments based upon the six-frame translations of both inputsequences.

Another exemplary alignment program according to embodiments of theinvention is BLAT from Kent Informatics (Santa Cruz, Calif.) (Kent, W.J., Genome Research 4: 656-664 (2002)). BLAT (which is not BLAST) keepsan index of the reference genome in memory such as RAM. The indexincludes of all non-overlapping k-mers (except optionally for thoseheavily involved in repeats), where k=11 by default. The genome itselfis not kept in memory. The index is used to find areas of probablehomology, which are then loaded into memory for a detailed alignment.

Another alignment program is SOAP2, from Beijing Genomics Institute(Beijing, Conn.) or BGI Americas Corporation (Cambridge, Mass.). SOAP2implements a 2-way BWT (Li et al., Bioinformatics 25(15):1966-67 (2009);Li, et al., Bioinformatics 24(5):713-14 (2008)).

Another program for aligning sequences is Bowtie (Langmead, et al.,Genome Biology, 10:R25 (2009)). Bowtie indexes reference genomes bymaking a BWT.

Other exemplary alignment programs include: Efficient Large-ScaleAlignment of Nucleotide Databases (ELAND) or the ELANDv2 component ofthe Consensus Assessment of Sequence and Variation (CASAVA) software(Illumina, San Diego, Calif.); RTG Investigator from Real Time Genomics,Inc. (San Francisco, Calif.); Novoalign from Novocraft (Selangor,Malaysia); Exonerate, European Bioinformatics Institute (Hinxton, UK)(Slater, G., and Birney, E., BMC Bioinformatics 6:31 (2005)), ClustalOmega, from University College Dublin (Dublin, Ireland) (Sievers F., etal., Mol Syst Biol 7, article 539 (2011)); ClustalW or ClustalX fromUniversity College Dublin (Dublin, Ireland) (Larkin M. A., et al.,Bioinformatics, 23, 2947-2948 (2007)); and FASTA, EuropeanBioinformatics Institute (Hinxton, UK) (Pearson W. R., et al., PNAS85(8):2444-8 (1988); Lipman, D. J., Science 227(4693):1435-41 (1985)).

With each contig aligned to the reference genome, any differencesbetween the contig and the reference genome can be identified by acomparison 113 (FIG. 2). For example, a computer alignment program canreport any non-matching pair of aligned bases, e.g., in a visual displayor as a character string. Since the entire contig is aligned to thereference genome, instead of individual short reads, an incorrectposition is improbable. Thus, any indels introduced here typically willrepresent differences between the nucleic acid and the reference.

Parameters of the alignment are optimized to ensure correct placement ofthe contig on the overall reference genome, for example, where thebwa-sw algorithm is used. Each mutation identified here in a contig orconsensus sequence can be described as it exists in the nucleic acidwith reference to the reference genome. For convenience, this could bereferred to as a contig:reference description of a mutation. Identifyinga mutation according to methods of the invention can include identifying1, 2, or any number of mutations as revealed, for example, by comparinga contig or consensus sequence to a reference genome.

In certain embodiments, a computer program creates a display, file, orvariable containing a description of the mutation or descriptions of anumber of mutations. In some embodiments, a contig:reference descriptionof mutations will be written in a file, e.g., a CIGAR string in a BAMformat file.

Besides aligning the contigs to a reference, methods of the inventioninclude analyzing the individual reads. Individual reads are aligned 117(FIG. 2) to the contigs, and positional and variant information ismapped from the reference to the individual reads through the contig.

An alignment according to this step can be performed by a computerprogram using any suitable algorithm such as one of those discussedabove. In certain embodiments, a pair-wise alignment is performedbetween a read and a contig and optionally iterated until each read ofinterest has been aligned. In certain embodiments, the reads are alignedto a contig using bwa-short as implemented by BWA. In some embodiments,only certain reads are aligned—for example, those having certain scoresor those that were excluded from further analysis prior to or duringassembly.

Parameters of this alignment can be optimized to ensure very highlysensitive detection of mutations and combinations of mutations. Forexample, because the position relative to the reference genome isestablished with confidence, a gap penalty can be decreased relative toa substitution probability. In this alignment step by these parameters,long indels, indel-proximal substitutions, and indels near the ends ofreads are detected with fidelity.

Each read represents a portion of the nucleic acid from the sample thatcan be described with reference to the contig. For convenience's sake,this could be referred to as a read:contig description. In certainembodiments, a computer program creates a display, file, or variablecontaining a read:contig description. The read:contig description can bewritten in a file, e.g., a CIGAR string in a BAM format file.

The output of the local alignment, describing the read with regards tothe contig, can be combined with the output of the reference alignment,describing the contig with regards to the reference genome. Thus, theportion of nucleic acid represented by a read is determined to berelated to a corresponding portion of the genome by the combination ofalignments. This can be expressed, for example, by combining aread:contig description with a contig:reference description to produce aread:reference description. In this way, a target can be genotyped 121,as shown in FIG. 2.

This combination gives, for any mutation detected in the nucleic acid, adescription of that mutation by reference to the reference genome. Thatis, the difference relative to the reference for the individual reads isdetermined using alignments between the read and the mapped contig plusalignments between the mapped contig and the reference genome.

In some embodiments, a heterogeneous sample includes some reads thatvary from the contig. For example, a contig may map to reference genomeaccording to a contig:reference alignment with a 1 bp substitution. Aread may map to the contig with a 8 bp deletion according to aread:contig alignment. Methods of the invention give the correctannotation for the read, relative to the reference, as a descriptionincluding a 8 bp deletion and a 1 bp substitution.

In some embodiments, any or all of the steps of the invention areautomated. For example, a Perl script or shell script can be written toinvoke any of the various programs discussed above (see, e.g., Tisdall,Mastering Perl for Bioinformatics, O'Reilly & Associates, Inc.,Sebastopol, Calif. 2003; Michael, R., Mastering Unix Shell Scripting,Wiley Publishing, Inc., Indianapolis, Ind. 2003). Alternatively, methodsof the invention may be embodied wholly or partially in one or morededicated programs, for example, each optionally written in a compiledlanguage such as C++ then compiled and distributed as a binary. Methodsof the invention may be implemented wholly or in part as modules within,or by invoking functionality within, existing sequence analysisplatforms. In certain embodiments, methods of the invention include anumber of steps that are all invoked automatically responsive to asingle starting queue (e.g., one or a combination of triggering eventssourced from human activity, another computer program, or a machine).Thus, the invention provides methods in which any or the steps or anycombination of the steps can occur automatically responsive to a queue.Automatically generally means without intervening human input,influence, or interaction (i.e., responsive only to original orpre-queue human activity). In certain embodiments, a contig is alignedto a reference, a read is aligned to a contig, the two alignments arecombined, and a mutation is identified based on the combination of thealignments, partially or entirely automatically.

By combining alignments according to methods of the invention, thelimitations of a tradeoff between substitution sensitivity and deletionsensitivity are overcome. The output includes an accurate and sensitiveinterpretation of the subject nucleic. The output can be provided in theformat of a computer file. In certain embodiments, the output is a FASTAfile, VCF file, text file, or an XML file containing sequence data suchas a sequence of the nucleic acid aligned to a sequence of the referencegenome. In other embodiments, the output contains coordinates or astring describing one or more mutations in the subject nucleic acidrelative to the reference genome. Alignment strings known in the artinclude Simple UnGapped Alignment Report (SUGAR), Verbose Useful LabeledGapped Alignment Report (VULGAR), and Compact Idiosyncratic GappedAlignment Report (CIGAR) (Ning, Z., et al., Genome Research11(10):1725-9 (2001)). These strings are implemented, for example, inthe Exonerate sequence alignment software from the EuropeanBioinformatics Institute (Hinxton, UK).

In some embodiments, the output is a sequence alignment—such as, forexample, a sequence alignment map (SAM) or binary alignment map (BAM)file—comprising a CIGAR string (the SAM format is described, e.g., inLi, et al., The Sequence Alignment/Map format and SAMtools,Bioinformatics, 2009, 25(16):2078-9). In some embodiments, CIGARdisplays or includes gapped alignments one-per-line. CIGAR is acompressed pairwise alignment format reported as a CIGAR string. A CIGARstring is useful for representing long (e.g. genomic) pairwisealignments. A CIGAR string is used in SAM format to represent alignmentsof reads to a reference genome sequence.

An CIGAR string follows an established motif. Each character is precededby a number, giving the base counts of the event. Characters used caninclude M, I, D, N, and S (M=match; I=insertion; D=deletion; N=gap;S=substitution).

The cigar line defines the sequence of matches/mismatches and deletions(or gaps). For example, the cigar line 2MD3M2D2M will mean that thealignment contains 2 matches, 1 deletion (number 1 is omitted in orderto save some space), 3 matches, 2 deletions and 2 matches.

To illustrate, if the original sequence is AACGCTT and the CIGAR stringis 2MD3M2D2M, the aligned sequence will be AA-CGG-TT. As a furtherexample, if an 80 bp read aligns to a contig such that the first 5′nucleotide of the read aligns to the 50th nucleotide from the 5′ end ofthe contig with no indels or substitutions between the read and thecontig, the alignment will yield “80M” as a CIGAR string.

By combining alignments according to the methods described herein, oneor more mutations in the nucleic acid can be detected, identified, ordescribed. Mutations detectable by methods of the invention includesubstitutions, single nucleotide polymorphisms, insertions, deletions,translocations, epigenetic mutations, and copy number repeats. Methodsof the invention detect and describe more than one mutation including,for example, two or more mutations. Mutations can be detected that arenear each other, for example less than 1,000 nucleotides apart along astrand of nucleic acid. In a preferred embodiment, methods of theinvention can detect two mutations within 100 bases of each other, andeven within 10 bases of each other.

Methods of the invention can detect two mutations of different typesnear each other including, for example, a substitution near an indel.Mutations near the ends of sequence reads can be detected including, forexample, an indel at or near an end of a read.

By detecting mutations with great sensitivity, methods of the inventioncan reliably detect disease-associated mutations. Disease associatedmutations include known mutations and novel ones. In certainembodiments, methods of the invention detect mutations known to beassociated with Tay-Sachs disease, Canavan disease, cystic fibrosis,familial dysautonomia, Bloom syndrome, Fanconi anemia group C, Gaucherdisease, mucolipidosis type IV, Niemann-Pick disease type A, spinalmuscular atrophy (SMA), Sickle cell anemia, Thalassemia, or novelmutations.

Other embodiments are within the scope and spirit of the invention. Forexample, due to the nature of software, functions described above can beimplemented using software, hardware, firmware, hardwiring, orcombinations of any of these. Features implementing functions can alsobe physically located at various positions, including being distributedsuch that portions of functions are implemented at different physicallocations.

As one skilled in the art would recognize as necessary or best-suitedfor performance of the methods of the invention and sequence assembly ingeneral, computer system 200 or machines of the invention include one ormore processors (e.g., a central processing unit (CPU) a graphicsprocessing unit (GPU) or both), a main memory and a static memory, whichcommunicate with each other via a bus.

In an exemplary embodiment shown in FIG. 3, system 200 can include asequencer 201 with data acquisition module 205 to obtain sequence readdata. Sequencer 201 may optionally include or be operably coupled to itsown, e.g., dedicated, sequencer computer 233 (including an input/outputmechanism 237, one or more of processor 241 and memory 245).Additionally or alternatively, sequencer 201 may be operably coupled toa server 213 or computer 249 (e.g., laptop, desktop, or tablet) vianetwork 209. Computer 249 includes one or more processor 259 and memory263 as well as an input/output mechanism 254. Where methods of theinvention employ a client/server architecture, an steps of methods ofthe invention may be performed using server 213, which includes one ormore of processor 221 and memory 229, capable of obtaining data,instructions, etc., or providing results via interface module 225 orproviding results as a file 217. Server 213 may be engaged over network209 through computer 249 or terminal 267, or server 213 may be directlyconnected to terminal 267, including one or more processor 275 andmemory 279, as well as input/output mechanism 271.

System 200 or machines according to the invention may further include,for any of I/O 249, 237, or 271 a video display unit (e.g., a liquidcrystal display (LCD) or a cathode ray tube (CRT)). Computer systems ormachines according to the invention can also include an alphanumericinput device (e.g., a keyboard), a cursor control device (e.g., amouse), a disk drive unit, a signal generation device (e.g., a speaker),a touchscreen, an accelerometer, a microphone, a cellular radiofrequency antenna, and a network interface device, which can be, forexample, a network interface card (NIC), Wi-Fi card, or cellular modem.

Memory 263, 245, 279, or 229 according to the invention can include amachine-readable medium on which is stored one or more sets ofinstructions (e.g., software) embodying any one or more of themethodologies or functions described herein. The software may alsoreside, completely or at least partially, within the main memory and/orwithin the processor during execution thereof by the computer system,the main memory and the processor also constituting machine-readablemedia.

The software may further be transmitted or received over a network viathe network interface device.

While the machine-readable medium can in an exemplary embodiment be asingle medium, the term “machine-readable medium” should be taken toinclude a single medium or multiple media (e.g., a centralized ordistributed database, and/or associated caches and servers) that storethe one or more sets of instructions. The term “machine-readable medium”shall also be taken to include any medium that is capable of storing,encoding or carrying a set of instructions for execution by the machineand that cause the machine to perform any one or more of themethodologies of the present invention. The term “machine-readablemedium” shall accordingly be taken to include, but not be limited to,solid-state memories (e.g., subscriber identity module (SIM) card,secure digital card (SD card), micro SD card, or solid-state drive(SSD)), optical and magnetic media, and any other tangible storagemedia.

INCORPORATION BY REFERENCE

References and citations to other documents, such as patents, patentapplications, patent publications, journals, books, papers, webcontents, have been made throughout this disclosure. All such documentsare hereby incorporated herein by reference in their entirety for allpurposes.

EQUIVALENTS

Various modifications of the invention and many further embodimentsthereof, in addition to those shown and described herein, will becomeapparent to those skilled in the art from the full contents of thisdocument, including references to the scientific and patent literaturecited herein. The subject matter herein contains important information,exemplification and guidance that can be adapted to the practice of thisinvention in its various embodiments and equivalents thereof.

EXAMPLES Example 1 Multiple (N) Reads from Two Alleles

A sample containing heterozygous human genomic nucleic acid is obtained.

A short human exon is captured by three different mutually overlappingmolecular inversion probes. The sample is heterozygous. The nucleic acidis sequenced to produce N different reads, each 80 bp.

All reads derived from one of those probes (i.e., all reads beginningwith the targeting sequence from one of the three probes) are collectedand passed to the assembly algorithm together with their known relativeoffsets, which is known because the reads were sequenced from theprobes.

This “subsetting” allows the efficiency of the assembly to be increasedby seeding the assembly with information about the expected position ofeach given read relative to the others. Thus, the start of each readderives from a genomic position that is known and specified by the probeset design.

The reads are assembled into two contigs. Two contigs are found, eachabout 180 bp. Each contig contains a unique combination of indels andsubstitutions. One contig corresponds to allele 1 and the other contigcorresponds to allele 2.

Each contig is aligned to a reference genome to determine the genomicposition of each contig. Each contig aligns to substantially the samereference position.

For each contig, any difference between that contig and a referencegenome is identified, where differences includes substitutions andindels. This is done using BWA-long, which returns a BAM format filecontaining the alignment of the contig and a CIGAR string describing anydifferences from the reference sequence. One contains a deletion and asubstitution, relative to the reference. The position for the allele 1contig (Contig1) is chromosome 7:11608523 and the CIGAR string for theallele 1 contig is 180M. The position for the allele 2 contig (Contig2)is chromosome 7:11608523 and the CIGAR string for the allele 1 contig is75M8D105M.

The increased sequence context provided by the assembled reads (contigs)improves the overall mapping accuracy. For example, the probes are morelikely to map uniquely to the reference genome as an assembled contigthan as individual reads.

The local, contig-specific alignment of each raw read is determinedusing BWA-short, with appropriate adjustments to substitution and gappenalties to ensure accurate placement of each read to its respectivecontig. That is, each raw read is aligned to the contig with which it isassociated. The positions and CIGAR strings are given in Table 1.

TABLE 1 Position and CIGAR string of each read Read Position CIGARstring 1 Contig1: 1 80M 2 Contig2: 1 80M 3 Contig2: 50 80M 4 Contig2:100 80M 5 Contig1: 50 80M 6 Contig1: 100 80M

For each raw read, any difference between that read and contig to whichit has been aligned is identified (i.e., as represented by the entriesin the CIGAR string column in Table 1). Most differences will representsequencing errors (since most common variations will typically becaptured in the contigs themselves). However, some differences between araw read and a contig will be evidence of rare alleles if the samplebeing analyzed is, e.g., a hetergoenous population of cells or ploidyother than diploid. The output of this alignment is in BAM format.

The alignments of the raw reads to the contigs are converted intoglobal, reference-based alignments. Specifically, the coordinates areconverted to reference coordinates (multiple contigs can map to the samereference position). For each individual read, the differences betweenthat read and the reference are determined using the differences betweenthat read and the mapped contig plus the differences between the mappedcontig and the reference.

TABLE 2 Map of raw read alignments from contig space to originalreference space Read to Read to Position Contig Position of Reference ofRead CIGAR Contig in CIGAR Read in Contig string Reference string 1Contig1: 1 80M Chr7: 11608523 85M 2 Contig2: 1 80M Chr7: 1160852375M8D5M 3 Contig2: 50 80M Chr7: 11608573 25M8D55M 4 Contig2: 100 80MChr7: 11608631 80M 5 Contig1: 50 80M Chr7: 11608573 80M 6 Contig2: 5030MS49M Chr7: 11608573 25M8D5MS49M . . . N Contig1: 100 80M Chr7:11608623 80M

One contig (Contig2) maps to the reference with an 8 bp deletion. Oneindividual read (Read 6) maps to that contig with a 1 bp substitutiononly. The correct annotation for this read, relative to the reference,is 8 bp deletion and 1 bp substitution. The CIGAR string in the BAM fileis adjusted accordingly. The adjusted CIGAR string for Read 6 is25M8D5MS49M, indicating that, beginning at position 11,608,573 ofchromosome 7, Read 6 matches chromosome 7 for 25 bases, then has an 8base deletion followed by 5 matching bases, then 1 substitution,followed by 49 more matching bases. Thus, an indel and a substitutionproximal to each other on the nucleic acid and captured in Read 6 aredescribed as mutations relative to the wild-type reference genome.

The genotype is assessed (i.e., for substitutions or indels) on theoutput BAM file. This allows for the detection of large indels andallows for accurate, quality score-based genotyping of substitutionsnear indels.

Example 2 Assemblies of Subsets

A sample containing a nucleic acid containing three exons is obtained.

Each exon is defined by a set of probe arms that covers it. Each exon iscaptured by the probes and sequenced to produce a plurality of reads,each approximately 85 bp.

All reads are collected and passed to the assembly algorithm and aresubset going into the assembly using the probe arms. Each subsetincludes those reads associated with one set of probe arms. This reducesthe runtime of the assembly drastically, relative to assembly withoutsubsetting the reads, because assembly time is exponential to number ofreads.

Each subset is assembled into a contig. Since each subset corresponds toa set of probe arms, and each exon is defined as the set of probe armsthat covers it, three contigs are found.

Each contig is aligned to a reference using BWA-long and any differencesare found and recorded as a position and a CIGAR string in acontig:reference BAM format file. This alignment step captures andstores variant information about the sample and thus the reads. Thevariant information is transferred through to the individual reads byaligning them to the contigs to produce a read:contig BAM file andmapping the first alignment onto the second. The individual reads arethus translated to include the variant and positional information.

Each exon in the nucleic acid from the sample is then genotyped based onthe translated reads.

1. A method for identifying a mutation in a nucleic acid, the methodcomprising: sequencing nucleic acid to generate a plurality of sequencereads; creating a contig based on the reads; aligning the contig to areference sequence; aligning the individual reads back to the contig;and identifying a mutation based on the alignments to the contig and thereference sequence. 2-12. (canceled)
 13. A system for identifying amutation in a nucleic acid, the system comprising: a computing deviceincluding a tangible, non-transitory memory coupled to a processorconfigured to execute computer program instructions to cause theprocessor to: receive a plurality of sequence reads; create a contigbased on the reads; align the contig to a reference sequence; align theindividual reads back to the contig; and identify a mutation based onthe alignments to the contig and the reference sequence.
 14. The systemof claim 13, wherein the computing device is further configured todivide the reads into subsets.
 15. The system of claim 13, wherein thecomputing device is further configured to determine a position of themutation in the nucleic acid.
 16. The system of claim 13, wherein themutation is selected from the group consisting of: a single nucleotidepolymorphism (SNP), an insertion, a deletion, a substitution, atranslocation, and a copy number variation.
 17. The system of claim 13,wherein the computing device is configured to identify a plurality ofmutations.
 18. The system of claim 17, wherein a first mutation iswithin about 100 nucleotides of a second mutation.
 19. The systemaccording to claim 18, wherein the first mutation is a substitution andthe second mutation is a deletion.
 20. The system of claim 13 whereinthe mutation is a deletion at an end of a sequence read.
 21. The systemof claim 13, wherein the mutation is associated with a disease.
 22. Thesystem according to claim 13, wherein the reads are produced bysequencing-by-synthesis.
 23. The system according to claim 22, whereinsequencing-by-synthesis is single molecule sequencing-by-synthesis.